Skip to content

[recipe][perf] feat: unified flat performance recipes for all model families#2803

Open
yaoyu-33 wants to merge 54 commits into
mainfrom
yuya/unified-perf-recipes
Open

[recipe][perf] feat: unified flat performance recipes for all model families#2803
yaoyu-33 wants to merge 54 commits into
mainfrom
yuya/unified-perf-recipes

Conversation

@yaoyu-33

@yaoyu-33 yaoyu-33 commented Mar 14, 2026

Copy link
Copy Markdown
Contributor

Summary

Consolidate performance benchmark recipes from the old two-level pipeline (scripts/performance/configs/) into self-contained flat recipes under src/megatron/bridge/recipes/<family>/<model>_perf.py.

Each perf recipe function is fully self-contained: it calls the base pretrain/SFT recipe, overrides parallelism and precision inline, then calls _benchmark_common(). This eliminates the old indirection through WorkloadBaseConfig + set_workload_base_configs + get_perf_optimized_recipe.

Models covered

  • Llama 3 / 3.1 — pretrain, SFT, LoRA (8B/70B/405B × H100/B200/B300/GB200/GB300)
  • DeepSeek V3 — pretrain V1+V2 (H100/B200/B300/GB200/GB300)
  • Qwen3 MoE — pretrain (30B-A3B / 235B-A22B)
  • Qwen3-VL — pretrain (30B-A3B / 235B-A22B)
  • Kimi K2 — pretrain (H100/B200/GB200/GB300)
  • NemotronH — pretrain (Nemotron 3 Nano + NemotronH 56B)
  • GPT-OSS — pretrain (120B)

Key changes

  • Add _benchmark_common() helper in common.py for shared perf overrides (train_iters, timing, te_rng_tracker derived from cuda_graph_impl)
  • Add _perf_precision() helper for bf16/fp8_cs/fp8_mx/nvfp4 mixed precision configs
  • Add Qwen3-VL pretrain base recipes (qwen3_vl_30b_a3b_pretrain_config, qwen3_vl_235b_a22b_pretrain_config) delegating to text-only configs
  • Remove 10 DeepSeek V3 recipes that have no legacy equivalent (fp8mx/nvfp4 for H100, nvfp4 for B200/B300/GB200 — unverifiable)
  • Update run_script.py to support new flat recipe discovery path
  • Add comprehensive equivalence test covering all 245 old↔new recipe pairs

Verification

All 245 recipe pairs pass config parity equivalence test on the cluster:

245 passed, 0 failed in 157.61s

Test plan

  • tests/unit_tests/recipes/test_all_perf_equivalence.py — 245/245 pass (old vs new config parity)
  • tests/unit_tests/recipes/llama/test_llama3_perf.py — Llama3 perf recipe unit tests
  • CI unit tests
  • Manual smoke test: run a perf benchmark with a new flat recipe on cluster

Made with Cursor

Summary by CodeRabbit

Release Notes

  • New Features

    • Added performance recipe configurations for DeepSeek V3, Llama3/3.1, GPT-OSS, Kimi K2, Qwen3/Qwen3-VL, and NemotronH across multiple GPU types (H100, B200, B300, GB200, GB300) and precisions (BF16, FP8, NVFP4).
    • Enhanced performance script to support new flat recipe format with automatic fallback to legacy configurations.
  • Documentation

    • Added performance recipe parity validation report documenting test results and identified configuration differences.
  • Tests

    • Added comprehensive equivalence tests validating compatibility between legacy and new performance recipe implementations.

@copy-pr-bot

copy-pr-bot Bot commented Mar 14, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yaoyu-33 yaoyu-33 added feature New capabilities, enhancements, or enablement work area:recipe Training recipes and launch configs labels Mar 16, 2026
@yaoyu-33 yaoyu-33 marked this pull request as ready for review March 19, 2026 01:06
@yaoyu-33 yaoyu-33 requested review from a team, erhoo82 and malay-nagda as code owners March 19, 2026 01:06
Comment thread docs/proposals/validate-perf-recipe-parity.md Outdated
@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/claude review

Comment thread tests/unit_tests/recipes/test_all_perf_equivalence.py Outdated
Comment thread src/megatron/bridge/recipes/deepseek/__init__.py Outdated

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good — clean consolidation of the perf recipes. Two items:

  1. Bug in equivalence test: _assert_configs_equal uses is not (identity check) instead of != for tp_comm_overlap_cfg comparison. This is a latent bug that will produce false failures once any recipe sets a non-None tp_comm_overlap_cfg on both old and new paths.

  2. Stale validation report: docs/proposals/validate-perf-recipe-parity.md shows 107/255 passing with 116 mismatches, but the PR description says 245/245 pass. The doc captures a pre-fix snapshot — consider updating it to reflect the final state, or noting prominently that it's a historical record, so future readers don't think half the recipes are broken.

@coderabbitai

coderabbitai Bot commented Mar 19, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This pull request introduces a comprehensive "flat perf recipes" framework for performance benchmarking across multiple model families. It adds convention-based recipe functions to replace legacy pipeline configs, implements a new recipe-loading mechanism in the performance runner, establishes a common benchmarking utility, and provides extensive test coverage to validate equivalence between old and new implementations.

Changes

Cohort / File(s) Summary
Documentation & Validation
docs/proposals/validate-perf-recipe-parity.md
New validation report documenting perf recipe parity test results across 8 model families, including passed equivalence cases, mismatches, observed patterns, and remediation checklist.
Core Performance Infrastructure
scripts/performance/run_script.py, src/megatron/bridge/recipes/common.py
Added get_perf_recipe_by_name() function to load flat recipes via convention-based naming; updated main() to attempt flat recipe lookup before falling back to legacy path. Added _benchmark_common() helper to apply standardized benchmark-focused config overrides (train iters, checkpointing, validation, gradient checks, CUDA graphs, etc.).
DeepSeek V3 Perf Recipes
src/megatron/bridge/recipes/deepseek/deepseek_v3_perf.py, src/megatron/bridge/recipes/deepseek/__init__.py
New 1024-line recipe module defining 32 configuration factories for DeepSeek V3 pretraining across GB300/GB200/B300/B200/H100 with bf16/fp8cs/fp8mx/nvfp4 precision; updated init.py to export all variants.
GPT-OSS Perf Recipes
src/megatron/bridge/recipes/gpt_oss/gpt_oss_perf.py, src/megatron/bridge/recipes/gpt_oss/__init__.py
New 214-line recipe module for GPT-OSS 120B pretraining on 64 GPUs across B200/B300/GB200/GB300/H100; 10 configs per variant (V1/V2); exports updated in init.py.
Kimi K2 Perf Recipes
src/megatron/bridge/recipes/kimi/kimi_k2_perf.py, src/megatron/bridge/recipes/kimi/__init__.py
New 456-line recipe module defining 12 Kimi K2 configurations across 256/1024-GPU variants and BF16/FP8 precisions; package init.py established with full export list.
Llama Perf Recipes
src/megatron/bridge/recipes/llama/llama3_perf.py, src/megatron/bridge/recipes/llama/llama31_perf.py, src/megatron/bridge/recipes/llama/__init__.py
New 2152-line Llama3 module covering 8B/70B pretraining, SFT, and LoRA across multiple GPUs/precisions; new 801-line Llama3.1 405B module for pretraining variants; init.py updated with 100+ export symbols.
NemotronH Perf Recipes
src/megatron/bridge/recipes/nemotronh/nemotronh_perf.py, src/megatron/bridge/recipes/nemotronh/__init__.py
New 554-line recipe module for NemotronH 56B (64-GPU) and Nemotron 3 Nano (8/16-GPU) across hardware/precision variants; exports updated in init.py.
Qwen3 MoE Perf Recipes
src/megatron/bridge/recipes/qwen/qwen3_moe_perf.py, src/megatron/bridge/recipes/qwen/__init__.py
New 1502-line recipe module with 44 configs for Qwen3 235B-A22B and 30B-A3B pretraining across GPU types and precisions (including V2 variants); all extended with complete symbol list.
Qwen3-VL Perf Recipes
src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py, src/megatron/bridge/recipes/qwen_vl/qwen3_vl_perf.py, src/megatron/bridge/recipes/qwen_vl/__init__.py
Added two pretrain config helpers in qwen3_vl.py; new 831-line qwen3_vl_perf.py with 22 recipe factories spanning 8/64/256-GPU variants; updated init.py exports.
Equivalence Test Coverage
tests/unit_tests/recipes/test_all_perf_equivalence.py, tests/unit_tests/recipes/llama/test_llama3_perf.py
New comprehensive 412-line test file validating parity across all model families using parametrized test classes; dedicated 319-line Llama3 test module with fine-grained equivalence assertions on config fields (model, train, DDP, scheduler, mixed-precision, comm-overlap, etc.).

Sequence Diagram(s)

sequenceDiagram
    participant User as Performance Runner
    participant Script as run_script.py
    participant FlatRecipe as Flat Recipe Module
    participant Legacy as Legacy Pipeline
    participant Common as common._benchmark_common()
    
    User->>Script: main(model_recipe_name, task, config_variant, ...)
    Script->>Script: construct recipe function name
    Script->>FlatRecipe: get_perf_recipe_by_name()
    alt Flat recipe found
        FlatRecipe-->>FlatRecipe: load + call recipe function
        FlatRecipe-->>Common: _benchmark_common(cfg)
        Common-->>Common: apply benchmark overrides (iters, logging, checks)
        FlatRecipe-->>Script: return ConfigContainer
        Script->>Script: using_flat_recipe = True
    else Flat recipe not found
        Script->>Legacy: get_perf_optimized_recipe()
        Legacy-->>Script: return ConfigContainer
        Script->>Script: set_post_overrides(cfg)
        Script->>Script: using_flat_recipe = False
    end
    Script-->>User: configured ConfigContainer
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR claims 245 passed tests but evidence shows critical gaps: test function omits dataset/optimizer/dist/peft/profiling validation, documented bugs pass undetected, results contradict earlier parity report showing 116 mismatches, and no CI logs provided. Provide actual CI logs with timestamps, expand test validation to all config sections, fix identified code bugs, and document intentionally excluded fields.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: introducing unified flat performance recipes for all model families, consolidating previous two-level pipeline architecture.
Docstring Coverage ✅ Passed Docstring coverage is 94.14% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch yuya/unified-perf-recipes
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py (1)

471-477: ⚠️ Potential issue | 🟠 Major

Bug: min_lr is greater than max_lr in the optimizer config.

The learning rate schedule has max_lr=5e-6 and min_lr=3e-5, but min_lr (3e-5) is actually larger than max_lr (5e-6). This will cause unexpected behavior in cosine annealing where the LR increases rather than decreases during decay.

🐛 Proposed fix
     opt_cfg, scheduler_cfg = distributed_fused_adam_with_cosine_annealing(
         lr_warmup_iters=500,
         lr_decay_iters=300000,
         max_lr=5e-6,
-        min_lr=3e-5,
+        min_lr=5e-7,
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py` around lines 471 - 477, The
optimizer schedule is configured with min_lr (3e-5) greater than max_lr (5e-6)
which inverts the intended cosine annealing behavior; update the call to
distributed_fused_adam_with_cosine_annealing (the assignment to opt_cfg,
scheduler_cfg) so that max_lr is larger than min_lr (e.g., swap the values or
set max_lr to 3e-5 and min_lr to 5e-6) to ensure learning rate decays correctly.
🧹 Nitpick comments (5)
src/megatron/bridge/recipes/gpt_oss/__init__.py (1)

30-44: Minor: Comment placement could be clearer.

The # V1 comment on line 35 appears after gb200 but before gb300, which is inconsistent with the __all__ ordering where gb300 comes first in the V1 section. Consider moving the comment to line 31 for consistency.

📝 Suggested reordering
 # GPT-OSS perf recipes
 from .gpt_oss_perf import (
+    # V1
+    gpt_oss_120b_pretrain_64gpu_gb300_bf16_config,
+    gpt_oss_120b_pretrain_64gpu_gb200_bf16_config,
+    gpt_oss_120b_pretrain_64gpu_b300_bf16_config,
     gpt_oss_120b_pretrain_64gpu_b200_bf16_config,
-    gpt_oss_120b_pretrain_64gpu_b300_bf16_config,
-    gpt_oss_120b_pretrain_64gpu_gb200_bf16_config,
-    # V1
-    gpt_oss_120b_pretrain_64gpu_gb300_bf16_config,
     gpt_oss_120b_pretrain_64gpu_h100_bf16_config,
+    # V2
+    gpt_oss_120b_pretrain_v2_64gpu_gb300_bf16_config,
+    gpt_oss_120b_pretrain_v2_64gpu_gb200_bf16_config,
+    gpt_oss_120b_pretrain_v2_64gpu_b300_bf16_config,
     gpt_oss_120b_pretrain_v2_64gpu_b200_bf16_config,
-    gpt_oss_120b_pretrain_v2_64gpu_b300_bf16_config,
-    gpt_oss_120b_pretrain_v2_64gpu_gb200_bf16_config,
-    # V2
-    gpt_oss_120b_pretrain_v2_64gpu_gb300_bf16_config,
     gpt_oss_120b_pretrain_v2_64gpu_h100_bf16_config,
 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/gpt_oss/__init__.py` around lines 30 - 44, Move
the inline section comment so the V1/V2 grouping matches the import order:
relocate the "# V1" comment to immediately before
gpt_oss_120b_pretrain_64gpu_gb300_bf16_config (so it sits above
gpt_oss_120b_pretrain_64gpu_gb300_bf16_config and
gpt_oss_120b_pretrain_64gpu_h100_bf16_config) and ensure the "# V2" comment
remains before gpt_oss_120b_pretrain_v2_64gpu_gb300_bf16_config; this aligns the
comments with the imports such as gpt_oss_120b_pretrain_64gpu_gb300_bf16_config,
gpt_oss_120b_pretrain_64gpu_h100_bf16_config,
gpt_oss_120b_pretrain_v2_64gpu_gb300_bf16_config, etc.
src/megatron/bridge/recipes/nemotronh/__init__.py (1)

47-74: Minor: Import ordering could be cleaner.

The NemotronH 56B imports (lines 68-73) are placed after the Nemotron 3 Nano H100 imports, making the grouping inconsistent. Consider grouping all NemotronH 56B imports together at the start, followed by all Nemotron 3 Nano imports.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/nemotronh/__init__.py` around lines 47 - 74, The
import block mixes NemotronH 56B symbols with Nemotron 3 Nano H100 entries;
reorder the import list in __init__.py so all NemotronH 56B configs
(nemotronh_56b_pretrain_64gpu_b200_fp8cs_config,
nemotronh_56b_pretrain_64gpu_b300_fp8cs_config,
nemotronh_56b_pretrain_64gpu_gb200_fp8cs_config,
nemotronh_56b_pretrain_64gpu_gb300_fp8cs_config,
nemotronh_56b_pretrain_64gpu_h100_fp8cs_config) are grouped together (e.g., at
the start of the block) and then group all Nemotron 3 Nano configs
(nemotron_3_nano_pretrain_*_config) afterwards to keep imports consistently
organized and readable.
tests/unit_tests/recipes/llama/test_llama3_perf.py (1)

90-101: This parity test does not exercise the new loader path.

The suite compares builder outputs directly via get_perf_optimized_recipe() and new_fn(), so regressions in scripts/performance/run_script.py still go green. The two loader bugs in this PR branch change behavior after recipe selection, and this file never hits that code.

Also applies to: 306-319

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/recipes/llama/test_llama3_perf.py` around lines 90 - 101,
The parity test currently compares outputs from get_perf_optimized_recipe and
new_fn directly and therefore never executes the new loader code path; update
the test to invoke the actual loader entrypoint used by the new pipeline (the
function that run_script uses to load and return a recipe) instead of calling
new_fn directly so the loader logic runs; specifically, replace or augment calls
to get_perf_optimized_recipe/new_fn in test_llama3_perf.py (and the similar
block around lines 306-319) to construct a recipe via the new pipeline loader
entrypoint, passing the same model, gpu and precision parameters, then compare
the final loaded recipe objects to exercise the new loader path and catch the
reported regressions.
src/megatron/bridge/recipes/qwen_vl/qwen3_vl_perf.py (1)

25-30: Use the Qwen-VL base entrypoints here.

These imports pull the text-only configs from src/megatron/bridge/recipes/qwen/qwen3_moe.py instead of the family-local wrappers in src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py. Today they delegate to the same base, but keeping the perf recipes on the Qwen-VL entrypoints is what preserves any future VLM-specific defaults or backend handling.

♻️ Suggested direction
-from megatron.bridge.recipes.qwen.qwen3_moe import (
-    qwen3_30b_a3b_pretrain_config,
-    qwen3_235b_a22b_pretrain_config,
-)
+from .qwen3_vl import (
+    qwen3_vl_30b_a3b_pretrain_config,
+    qwen3_vl_235b_a22b_pretrain_config,
+)

Then update the call sites below to use qwen3_vl_*_pretrain_config().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/qwen_vl/qwen3_vl_perf.py` around lines 25 - 30,
Replace the direct imports of text-only Qwen configs with the Qwen-VL family
entrypoints: instead of importing qwen3_30b_a3b_pretrain_config and
qwen3_235b_a22b_pretrain_config from qwen.qwen3_moe, import their Qwen-VL
counterparts from qwen_vl.qwen3_vl (use qwen3_vl_30b_a3b_pretrain_config and
qwen3_vl_235b_a22b_pretrain_config). Then update any call sites in this file
that construct or pass configs to use qwen3_vl_*_pretrain_config() (keep
existing uses of _perf_precision and _benchmark_common unchanged).
src/megatron/bridge/recipes/llama/llama3_perf.py (1)

55-71: Promote _perf_precision() to the shared recipe helpers.

src/megatron/bridge/recipes/qwen_vl/qwen3_vl_perf.py already imports this helper, so the precision preset logic now lives behind a Llama-specific module boundary. Moving it alongside _benchmark_common() in src/megatron/bridge/recipes/common.py keeps the dependency direction clean and avoids making other families import the whole Llama perf module.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/llama/llama3_perf.py` around lines 55 - 71, Move
the _perf_precision function out of the Llama-specific file into the shared
recipe helpers alongside _benchmark_common in the common recipe module so other
model families can import it without depending on the Llama perf module;
specifically, relocate the function named _perf_precision (and its use-sites
such as imports in qwen3_vl_perf.py) into src/megatron/bridge/recipes/common.py,
preserving the logic that returns bf16_mixed(),
bf16_with_fp8_current_scaling_mixed() (with cfg.first_last_layers_bf16 = False),
bf16_with_mxfp8_mixed(), and bf16_with_nvfp4_mixed(), update any imports that
referenced _perf_precision from llama3_perf.py to import it from the common
helper, and ensure no other Llama-specific symbols are moved or referenced.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/proposals/validate-perf-recipe-parity.md`:
- Around line 7-208: The parity report in validate-perf-recipe-parity.md is
stale (shows 107/255 passed, 116 mismatched, 22 import errors) — either update
the document to reflect the new final result (245 of 245 parity run) by changing
the Summary counts, removing or collapsing obsolete Category 3/4 sections and
the Root Cause/Next Steps that are already addressed, and note the 10 DeepSeek
skips; or delete/archival the file if you don't want historical pre-fix details
in docs; search for the document title "validate-perf-recipe-parity.md" and edit
its Summary, tables, and category sections (and any references to
_benchmark_common(), qwen3_vl_30b_a3b_pretrain_config, DeepSeek entries) to
match the current successful run.

In `@scripts/performance/run_script.py`:
- Around line 31-78: get_perf_recipe_by_name currently builds and calls flat
recipe builders with no arguments so --optimizer-type is ignored for Kimi;
update get_perf_recipe_by_name signature to accept an optimizer_type parameter
and thread it when invoking the recipe function (e.g., call
recipe_fn(optimizer_type=optimizer_type) if the builder accepts it), update main
to pass args.optimizer_type into get_perf_recipe_by_name when using the flat
path, and update the Kimi flat recipe builder functions to accept an
optimizer_type arg and forward it into kimi_k2_pretrain_config(...) so the
selected optimizer is honored.
- Around line 113-124: The code currently skips calling set_post_overrides when
using_flat_recipe is true, which omits important launch-time mutations (e.g.,
BF16 + Adam precision-aware optimizer, overlap_param_gather_with_optimizer_step,
and GPU-count-based GBS scaling). Instead of skipping set_post_overrides
entirely, extract the non-workload-dependent mutations from set_post_overrides
into a small helper (e.g., apply_launch_overrides) and call that helper for both
flat and legacy recipes; keep workload-dependent logic in set_post_overrides and
only skip that part for flat recipes, ensuring recipe still receives
BF16/optimizer/overlap/GBS adjustments even when using_flat_recipe is true.

In `@src/megatron/bridge/recipes/deepseek/deepseek_v3_perf.py`:
- Around line 489-495: The H100 recipe is only disabling
cfg.comm_overlap.overlap_grad_reduce but not cfg.ddp.overlap_grad_reduce, so
overlap will remain enabled due to OR logic; update the H100 recipe blocks in
deepseek_v3_perf.py (the sections that call
set_deepseek_v3_pipeline_model_parallel_layout and then _benchmark_common(cfg))
to also set cfg.ddp.overlap_grad_reduce = False alongside
cfg.comm_overlap.overlap_grad_reduce = False; ensure you add the same assignment
in each H100 recipe that currently only sets
cfg.comm_overlap.overlap_grad_reduce so they match the GB300/B300 pattern and
the base deepseek_v3_pretrain_config override is undone.

In `@src/megatron/bridge/recipes/kimi/kimi_k2_perf.py`:
- Around line 49-50: The dataset sequence-length override is using the wrong
attribute name (cfg.dataset.seq_length) so it has no effect; update all
occurrences in the Kimi perf builders to use cfg.dataset.sequence_length instead
of cfg.dataset.seq_length (leave cfg.model.seq_length as-is), i.e., replace
cfg.dataset.seq_length -> cfg.dataset.sequence_length for each builder
function/config block in kimi_k2_perf.py so the dataset config set in kimi_k2.py
is actually overridden.

In `@tests/unit_tests/recipes/test_all_perf_equivalence.py`:
- Around line 171-224: The parity check in _assert_configs_equal is incomplete:
it skips many fields (dataset, optimizer, dist, peft, profiling and additional
model/comm_overlap fields) so ConfigContainer differences can be missed; update
_assert_configs_equal to compare all relevant sub-objects (dataset, optimizer,
dist, peft, profiling) and missing model/comm_overlap attributes (e.g.,
dataset.seq_length, dataset_kwargs, wgrad_deferral_limit,
moe_flex_dispatcher_backend) by either iterating over their dataclass fields or
delegating to _compare_dataclass for each of these members, and ensure
comm_overlap nested objects (like tp_comm_overlap_cfg) are compared deeply
rather than by identity so that old vs new recipe changes are detected.
- Around line 274-282: The helper _make_cases currently swallows ValueError from
_parse_recipe_name causing misnamed "*_config" recipes to be ignored; change it
to fail fast by removing the try/except or re-raising a descriptive error (or
call pytest.fail) when _parse_recipe_name(name) raises, so malformed recipe
names surface in test failures; update the block around _make_cases and the call
to _parse_recipe_name to propagate the exception with a clear message including
the offending name rather than silently passing.

---

Outside diff comments:
In `@src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py`:
- Around line 471-477: The optimizer schedule is configured with min_lr (3e-5)
greater than max_lr (5e-6) which inverts the intended cosine annealing behavior;
update the call to distributed_fused_adam_with_cosine_annealing (the assignment
to opt_cfg, scheduler_cfg) so that max_lr is larger than min_lr (e.g., swap the
values or set max_lr to 3e-5 and min_lr to 5e-6) to ensure learning rate decays
correctly.

---

Nitpick comments:
In `@src/megatron/bridge/recipes/gpt_oss/__init__.py`:
- Around line 30-44: Move the inline section comment so the V1/V2 grouping
matches the import order: relocate the "# V1" comment to immediately before
gpt_oss_120b_pretrain_64gpu_gb300_bf16_config (so it sits above
gpt_oss_120b_pretrain_64gpu_gb300_bf16_config and
gpt_oss_120b_pretrain_64gpu_h100_bf16_config) and ensure the "# V2" comment
remains before gpt_oss_120b_pretrain_v2_64gpu_gb300_bf16_config; this aligns the
comments with the imports such as gpt_oss_120b_pretrain_64gpu_gb300_bf16_config,
gpt_oss_120b_pretrain_64gpu_h100_bf16_config,
gpt_oss_120b_pretrain_v2_64gpu_gb300_bf16_config, etc.

In `@src/megatron/bridge/recipes/llama/llama3_perf.py`:
- Around line 55-71: Move the _perf_precision function out of the Llama-specific
file into the shared recipe helpers alongside _benchmark_common in the common
recipe module so other model families can import it without depending on the
Llama perf module; specifically, relocate the function named _perf_precision
(and its use-sites such as imports in qwen3_vl_perf.py) into
src/megatron/bridge/recipes/common.py, preserving the logic that returns
bf16_mixed(), bf16_with_fp8_current_scaling_mixed() (with
cfg.first_last_layers_bf16 = False), bf16_with_mxfp8_mixed(), and
bf16_with_nvfp4_mixed(), update any imports that referenced _perf_precision from
llama3_perf.py to import it from the common helper, and ensure no other
Llama-specific symbols are moved or referenced.

In `@src/megatron/bridge/recipes/nemotronh/__init__.py`:
- Around line 47-74: The import block mixes NemotronH 56B symbols with Nemotron
3 Nano H100 entries; reorder the import list in __init__.py so all NemotronH 56B
configs (nemotronh_56b_pretrain_64gpu_b200_fp8cs_config,
nemotronh_56b_pretrain_64gpu_b300_fp8cs_config,
nemotronh_56b_pretrain_64gpu_gb200_fp8cs_config,
nemotronh_56b_pretrain_64gpu_gb300_fp8cs_config,
nemotronh_56b_pretrain_64gpu_h100_fp8cs_config) are grouped together (e.g., at
the start of the block) and then group all Nemotron 3 Nano configs
(nemotron_3_nano_pretrain_*_config) afterwards to keep imports consistently
organized and readable.

In `@src/megatron/bridge/recipes/qwen_vl/qwen3_vl_perf.py`:
- Around line 25-30: Replace the direct imports of text-only Qwen configs with
the Qwen-VL family entrypoints: instead of importing
qwen3_30b_a3b_pretrain_config and qwen3_235b_a22b_pretrain_config from
qwen.qwen3_moe, import their Qwen-VL counterparts from qwen_vl.qwen3_vl (use
qwen3_vl_30b_a3b_pretrain_config and qwen3_vl_235b_a22b_pretrain_config). Then
update any call sites in this file that construct or pass configs to use
qwen3_vl_*_pretrain_config() (keep existing uses of _perf_precision and
_benchmark_common unchanged).

In `@tests/unit_tests/recipes/llama/test_llama3_perf.py`:
- Around line 90-101: The parity test currently compares outputs from
get_perf_optimized_recipe and new_fn directly and therefore never executes the
new loader code path; update the test to invoke the actual loader entrypoint
used by the new pipeline (the function that run_script uses to load and return a
recipe) instead of calling new_fn directly so the loader logic runs;
specifically, replace or augment calls to get_perf_optimized_recipe/new_fn in
test_llama3_perf.py (and the similar block around lines 306-319) to construct a
recipe via the new pipeline loader entrypoint, passing the same model, gpu and
precision parameters, then compare the final loaded recipe objects to exercise
the new loader path and catch the reported regressions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 267e13dc-daa7-4963-b049-fa7790d3d428

📥 Commits

Reviewing files that changed from the base of the PR and between acc52e2 and 23571b8.

📒 Files selected for processing (21)
  • docs/proposals/validate-perf-recipe-parity.md
  • scripts/performance/run_script.py
  • src/megatron/bridge/recipes/common.py
  • src/megatron/bridge/recipes/deepseek/__init__.py
  • src/megatron/bridge/recipes/deepseek/deepseek_v3_perf.py
  • src/megatron/bridge/recipes/gpt_oss/__init__.py
  • src/megatron/bridge/recipes/gpt_oss/gpt_oss_perf.py
  • src/megatron/bridge/recipes/kimi/__init__.py
  • src/megatron/bridge/recipes/kimi/kimi_k2_perf.py
  • src/megatron/bridge/recipes/llama/__init__.py
  • src/megatron/bridge/recipes/llama/llama31_perf.py
  • src/megatron/bridge/recipes/llama/llama3_perf.py
  • src/megatron/bridge/recipes/nemotronh/__init__.py
  • src/megatron/bridge/recipes/nemotronh/nemotronh_perf.py
  • src/megatron/bridge/recipes/qwen/__init__.py
  • src/megatron/bridge/recipes/qwen/qwen3_moe_perf.py
  • src/megatron/bridge/recipes/qwen_vl/__init__.py
  • src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py
  • src/megatron/bridge/recipes/qwen_vl/qwen3_vl_perf.py
  • tests/unit_tests/recipes/llama/test_llama3_perf.py
  • tests/unit_tests/recipes/test_all_perf_equivalence.py

Comment thread docs/proposals/validate-perf-recipe-parity.md Outdated
Comment thread scripts/performance/run_script.py Outdated
Comment thread scripts/performance/run_script.py Outdated
Comment thread src/megatron/bridge/perf_recipes/deepseek/deepseek_v3_perf.py
Comment thread src/megatron/bridge/recipes/kimi/kimi_k2_perf.py Outdated
Comment thread tests/unit_tests/recipes/test_all_perf_equivalence.py Outdated
Comment thread tests/unit_tests/recipes/test_all_perf_equivalence.py Outdated
@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test 3d47abd

@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test 3859b91

@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test eecbe1e

@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/claude review

Comment thread tests/unit_tests/recipes/test_all_perf_equivalence.py Outdated
@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test 79ec86a

@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test 188a5bb

@yaoyu-33

Copy link
Copy Markdown
Contributor Author

Theo follow-up fix pushed and revalidated.

Validated PR head de96af2963b4fcc52dea7baa3e7f96ba9e989bfd against latest origin/main at validation time. Internal validation run completed successfully at the runner level; detailed logs/artifacts are retained internally.

Summary:

  • Nemotron 3 Super family-map route: fixed. GB300 bf16, fp8_mx, and nvfp4 all generated on PR and matched main after path normalization.
  • Qwen3-VL generated-config parity: fixed for generated targets. The 30B-A3B subset generated on both sides for all 11 targeted configs and now matches main. model.apply_rope_fusion remains False in PR for every generated Qwen3-VL config. The 235B-A22B subset still fails before generation on both main and PR because the public HF config is unavailable in the validation environment; this is not PR-specific.
  • CUDA graph RNG gate: validated as expected. Llama3 8B R100 bf16, fp8_cs, fp8_mx, and nvfp4 all generated on both sides. The only normalized diffs are the intended tracker flips: model.use_te_rng_tracker and rng.te_rng_tracker False -> True when cuda_graph_scope contains full_iteration.

Validation counts: MATCH=14, EXPECTED_DIFF=4, PR_FAIL=11 where the 11 failures are the same Qwen3-VL 235B-A22B environment/cache failures on both main and PR.

No local unit tests were run; this was targeted internal dry-run/config validation plus pre-commit on the changed tree.

@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test de96af2

yaoyu-33 and others added 10 commits May 26, 2026 13:30
ConfigContainer.yaml stopped appearing in CI artifacts because Bridge's
maybe_log_and_save_config calls cfg.to_yaml() via plain open(..., "w"),
which fails with FileNotFoundError if the parent directory doesn't exist.
The failure is swallowed by maybe_log_and_save_config's try/except and the
after_script's `cp ... || true`, so the absence is silent.

Both places in set_user_overrides / _set_checkpoint_overrides that assign
recipe.logger.save_config_filepath now also create the parent directory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33

yaoyu-33 commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

PR #2803 Config Dump vs Legacy Run Parity Report

Scope

Compared the finalized legacy performance config path on main against the flat recipe dump path on this PR after merging latest main.

Branches

  • Base: main at 4b68d3e5
  • Feature: yuya/unified-perf-recipes at d2d7eb53

How I Ran It

  1. Synced the remote comparison workspace to origin/main and origin/yuya/unified-perf-recipes.
  2. Used the feature branch scripts/performance/dump_perf_configs.py COMBOS list as the shared comparison matrix.
  3. Dumped old configs by loading the legacy scripts/performance/configs path and applying set_post_overrides(...), matching the actual run finalization path.
  4. Dumped new configs by loading flat recipe functions and applying the remaining BF16 Adam runtime override still owned by run_script.py.
  5. Canonicalized only dump-only cwd-derived checkpoint paths so branch checkout location does not count as a config diff.
  6. Compared YAMLs field-by-field.

Fixes Made

  • Moved optimizer-step parameter gather overlap out of run_script.py and into the affected flat recipes.
  • Matched legacy-scaled train.global_batch_size for GPU-count alias recipes.
  • Updated dump_perf_configs.py so old/new dumps compare finalized configs rather than raw recipe construction.
  • Removed unsupported NemotronH BF16 comparison tuples that do not exist in the legacy path.

Example Diffs Before Fix

  • deepseek_v3_pretrain_256gpu_gb200_bf16: legacy finalized overlap was true, flat recipe had optimizer overlap false and comm overlap null.
  • llama3_8b_pretrain_64gpu_h100_bf16: legacy finalized train.global_batch_size was 1024, flat recipe had 128.

Final Result

old_count=127
new_count=127
missing_in_old=0
missing_in_new=0
identical=127
different=0

No old/new dump failures remained.

Validation

  • python3 -m py_compile ... on changed scripts and recipe files: passed
  • uv run --no-sync pre-commit run --all-files: passed
  • Full remote dump comparison in offline cached environment: passed, 127/127 identical

Artifacts

Comparison artifacts are in the remote scratch workspace under:

  • pr2803_cluster/out_canonical/results/compare_summary.txt
  • pr2803_cluster/out_canonical/results/compare.json
  • pr2803_cluster/out_canonical/results/main_old_v2_failures.tsv
  • pr2803_cluster/out_canonical/results/feature_new_failures.tsv

No tokens or environment-specific paths are included in this report.

yaoyu-33 added 2 commits June 14, 2026 12:08
…ipes

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test 41622a5

@yaoyu-33

Copy link
Copy Markdown
Contributor Author

PR #2803 Current-Head Config Dump vs Legacy Run Parity

Validated current PR head 41622a51bff106b07e3cbfac68729c8b2b41f45b.

Baseline used for this comparison: the current-head legacy performance path (scripts/performance/configs + set_post_overrides) from the same PR checkout, with the branch merge-base at 83d3fd190033c36b18997e601640c840c9c1edf5. Current origin/main at validation time was eaccbb81c95c6ff5241c9307cff29075a7775729 and was not used as the old-path baseline for this current-head legacy-vs-flat parity check.

Comparison summary:

old_count=127
new_count=127
missing_in_old=0
missing_in_new=0
identical=127
different=0

Gated/inaccessible configs: 0 in this validation. The validation environment had the required HF config artifacts available, so no expected configs were excluded or left incomplete. Failure TSVs for both old and new paths are empty.

Static/check status observed:

Artifacts retained internally under <artifact-root>/out_current_head_self/, including:

  • results/compare_summary.txt
  • results/compare.json
  • results/main_old_v2_failures.tsv
  • results/feature_new_failures.tsv

@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test c48ff31

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:recipe Training recipes and launch configs feature New capabilities, enhancements, or enablement work needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants